{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Metodología para clasificación\n",
"### Aprendizaje Automático - Instituto de Computación - UdelaR\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Metodología para clasificación\n",
"\n",
"- Indepedientemente del método utilizado, existen etapas comunes para poder hacer aprendizaje supervisado (en particular, clasificación).\n",
"\n",
"- ¿Cuál es nuestra tarea?: dado un conjunto $X$ de instancias independientes con una cierta distribución $D$, cada una de ellas con una clase $y$ asociada, queremos construir una función de clasificación que, dada una instancia nueva, nos devuelva su clase. \n",
"\n",
"- Algunas preguntas: ¿cómo aprendo la función?, ¿sobre qué instancias?, ¿cómo evalúo la performance de la función?\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Fase 1: Preprocesamiento"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Vamos a suponer que, para poder entrenar un clasificador, debemos partir de un conjunto $D = \\{(x_i,y_i)\\}$, llamado conjunto de entrenamiento, donde cada instancia $x_i \\in \\mathbb{R}^n$ y $y_i \\in \\mathbb{R}$ (no todos los algoritmos de aprendizaje necesitan este formato de entrada, es solamente para fijar ideas)\n",
"\n",
"- (Des) afortunadamente, los conjuntos de datos que generalmente disponemos surgen de sensores, datos ingresados por humanos, fuentes diferentes, etc. Por lo tanto, debemos limpiarlos (_data cleaning_).\n",
"\n",
"- El formato de los datos originales puede ser diverso: elementos de un conjunto (categóricos), fechas, textos, imágenes, etc. Debemos buscar formas para convertirlos a un formato aceptable por el algoritmo (_data transformation_)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Nuestros datos pueden venir de diferentes fuentes, debemos integrarlos (_data integration_)\n",
"\n",
"- Puede que, para ser más eficientes en los tiempos de aprendizaje, sea necesario agrupar datos, eliminar atributos o reducir el numero de instancias, buscando no perder infromación (_data reduction_)\n",
"\n",
"\n",
"\n",
"Fuente de la imgen: [Data Preprocessing](https://medium.com/@silicon.smile1/data-preprocessing-b1552b4060f3) - Umar Farooq "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Titanic\n",
"\n",
"Trabajaremos con un ejemplo utilizando pandas (ya que estamos, importamos otras bibliotecas que probablemente utilicemos). \n",
"- Titanic Dataset: listado de pasajeros del Titanic, indicando si sobrevivieron o no. Más detalles [aquí](https://www.kaggle.com/c/titanic). "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import sklearn\n",
"import sklearn.preprocessing\n",
"import sklearn.feature_selection\n",
"pd.options.mode.chained_assignment = 'warn' # default='warn'\n",
"import graphviz\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | row.names | \n", "pclass | \n", "survived | \n", "name | \n", "age | \n", "embarked | \n", "home.dest | \n", "room | \n", "ticket | \n", "boat | \n", "sex | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "1st | \n", "1 | \n", "Allen, Miss Elisabeth Walton | \n", "29.0000 | \n", "Southampton | \n", "St Louis, MO | \n", "B-5 | \n", "24160 L221 | \n", "2 | \n", "female | \n", "
1 | \n", "2 | \n", "1st | \n", "0 | \n", "Allison, Miss Helen Loraine | \n", "2.0000 | \n", "Southampton | \n", "Montreal, PQ / Chesterville, ON | \n", "C26 | \n", "NaN | \n", "NaN | \n", "female | \n", "
2 | \n", "3 | \n", "1st | \n", "0 | \n", "Allison, Mr Hudson Joshua Creighton | \n", "30.0000 | \n", "Southampton | \n", "Montreal, PQ / Chesterville, ON | \n", "C26 | \n", "NaN | \n", "(135) | \n", "male | \n", "
3 | \n", "4 | \n", "1st | \n", "0 | \n", "Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) | \n", "25.0000 | \n", "Southampton | \n", "Montreal, PQ / Chesterville, ON | \n", "C26 | \n", "NaN | \n", "NaN | \n", "female | \n", "
4 | \n", "5 | \n", "1st | \n", "1 | \n", "Allison, Master Hudson Trevor | \n", "0.9167 | \n", "Southampton | \n", "Montreal, PQ / Chesterville, ON | \n", "C22 | \n", "NaN | \n", "11 | \n", "male | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1308 | \n", "1309 | \n", "3rd | \n", "0 | \n", "Zakarian, Mr Artun | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1309 | \n", "1310 | \n", "3rd | \n", "0 | \n", "Zakarian, Mr Maprieder | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1310 | \n", "1311 | \n", "3rd | \n", "0 | \n", "Zenn, Mr Philip | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1311 | \n", "1312 | \n", "3rd | \n", "0 | \n", "Zievens, Rene | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "
1312 | \n", "1313 | \n", "3rd | \n", "0 | \n", "Zimmerman, Leo | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1313 rows × 11 columns
\n", "\n", " | row.names | \n", "pclass | \n", "name | \n", "age | \n", "embarked | \n", "home.dest | \n", "room | \n", "ticket | \n", "boat | \n", "sex | \n", "
---|---|---|---|---|---|---|---|---|---|---|
1086 | \n", "1087 | \n", "3rd | \n", "Olsen, Master Arthur | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
12 | \n", "13 | \n", "1st | \n", "Aubert, Mrs Leontine Pauline | \n", "NaN | \n", "Cherbourg | \n", "Paris, France | \n", "B-35 | \n", "17477 L69 6s | \n", "9 | \n", "female | \n", "
1036 | \n", "1037 | \n", "3rd | \n", "Moubarek, Master William George | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
833 | \n", "834 | \n", "3rd | \n", "Gronnestad, Mr Daniel Danielsen | \n", "32.0 | \n", "Southampton | \n", "Foresvik, Norway Portland, ND | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1108 | \n", "1109 | \n", "3rd | \n", "Paulsson, Master Gosta Leonard | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1225 | \n", "1226 | \n", "3rd | \n", "Stankovic, Mr Jovan | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
658 | \n", "659 | \n", "3rd | \n", "Baclini, Miss Helene | \n", "NaN | \n", "Cherbourg | \n", "Syria New York, NY | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "
578 | \n", "579 | \n", "2nd | \n", "Watt, Miss Bertha | \n", "12.0 | \n", "Southampton | \n", "Aberdeen / Portland, OR | \n", "NaN | \n", "NaN | \n", "9 | \n", "female | \n", "
391 | \n", "392 | \n", "2nd | \n", "Dibden, Mr William | \n", "18.0 | \n", "Southampton | \n", "New Forest, England | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1044 | \n", "1045 | \n", "3rd | \n", "Murphy, Miss Margaret | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "
984 rows × 10 columns
\n", "\n", " | row.names | \n", "pclass | \n", "name | \n", "age | \n", "embarked | \n", "home.dest | \n", "room | \n", "ticket | \n", "boat | \n", "sex | \n", "
---|---|---|---|---|---|---|---|---|---|---|
1086 | \n", "1087 | \n", "3rd | \n", "Olsen, Master Arthur | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
12 | \n", "13 | \n", "1st | \n", "Aubert, Mrs Leontine Pauline | \n", "31.029621 | \n", "Cherbourg | \n", "Paris, France | \n", "B-35 | \n", "17477 L69 6s | \n", "9 | \n", "female | \n", "
1036 | \n", "1037 | \n", "3rd | \n", "Moubarek, Master William George | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
833 | \n", "834 | \n", "3rd | \n", "Gronnestad, Mr Daniel Danielsen | \n", "32.000000 | \n", "Southampton | \n", "Foresvik, Norway Portland, ND | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1108 | \n", "1109 | \n", "3rd | \n", "Paulsson, Master Gosta Leonard | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1225 | \n", "1226 | \n", "3rd | \n", "Stankovic, Mr Jovan | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
658 | \n", "659 | \n", "3rd | \n", "Baclini, Miss Helene | \n", "31.029621 | \n", "Cherbourg | \n", "Syria New York, NY | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "
578 | \n", "579 | \n", "2nd | \n", "Watt, Miss Bertha | \n", "12.000000 | \n", "Southampton | \n", "Aberdeen / Portland, OR | \n", "NaN | \n", "NaN | \n", "9 | \n", "female | \n", "
391 | \n", "392 | \n", "2nd | \n", "Dibden, Mr William | \n", "18.000000 | \n", "Southampton | \n", "New Forest, England | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "
1044 | \n", "1045 | \n", "3rd | \n", "Murphy, Miss Margaret | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "
984 rows × 10 columns
\n", "\n", " | row.names | \n", "pclass | \n", "name | \n", "age | \n", "embarked | \n", "home.dest | \n", "room | \n", "ticket | \n", "boat | \n", "sex | \n", "
---|---|---|---|---|---|---|---|---|---|---|
1086 | \n", "1087 | \n", "3rd | \n", "Olsen, Master Arthur | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1 | \n", "
12 | \n", "13 | \n", "1st | \n", "Aubert, Mrs Leontine Pauline | \n", "31.029621 | \n", "Cherbourg | \n", "Paris, France | \n", "B-35 | \n", "17477 L69 6s | \n", "9 | \n", "0 | \n", "
1036 | \n", "1037 | \n", "3rd | \n", "Moubarek, Master William George | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1 | \n", "
833 | \n", "834 | \n", "3rd | \n", "Gronnestad, Mr Daniel Danielsen | \n", "32.000000 | \n", "Southampton | \n", "Foresvik, Norway Portland, ND | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1 | \n", "
1108 | \n", "1109 | \n", "3rd | \n", "Paulsson, Master Gosta Leonard | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1225 | \n", "1226 | \n", "3rd | \n", "Stankovic, Mr Jovan | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1 | \n", "
658 | \n", "659 | \n", "3rd | \n", "Baclini, Miss Helene | \n", "31.029621 | \n", "Cherbourg | \n", "Syria New York, NY | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "
578 | \n", "579 | \n", "2nd | \n", "Watt, Miss Bertha | \n", "12.000000 | \n", "Southampton | \n", "Aberdeen / Portland, OR | \n", "NaN | \n", "NaN | \n", "9 | \n", "0 | \n", "
391 | \n", "392 | \n", "2nd | \n", "Dibden, Mr William | \n", "18.000000 | \n", "Southampton | \n", "New Forest, England | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1 | \n", "
1044 | \n", "1045 | \n", "3rd | \n", "Murphy, Miss Margaret | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "
984 rows × 10 columns
\n", "\n", " | row.names | \n", "pclass | \n", "name | \n", "age | \n", "embarked | \n", "home.dest | \n", "room | \n", "ticket | \n", "boat | \n", "sex | \n", "class_1st | \n", "class_2nd | \n", "class_3rd | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1086 | \n", "1087 | \n", "3rd | \n", "Olsen, Master Arthur | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
12 | \n", "13 | \n", "1st | \n", "Aubert, Mrs Leontine Pauline | \n", "31.029621 | \n", "Cherbourg | \n", "Paris, France | \n", "B-35 | \n", "17477 L69 6s | \n", "9 | \n", "female | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
1036 | \n", "1037 | \n", "3rd | \n", "Moubarek, Master William George | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
833 | \n", "834 | \n", "3rd | \n", "Gronnestad, Mr Daniel Danielsen | \n", "32.000000 | \n", "Southampton | \n", "Foresvik, Norway Portland, ND | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
1108 | \n", "1109 | \n", "3rd | \n", "Paulsson, Master Gosta Leonard | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1225 | \n", "1226 | \n", "3rd | \n", "Stankovic, Mr Jovan | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
658 | \n", "659 | \n", "3rd | \n", "Baclini, Miss Helene | \n", "31.029621 | \n", "Cherbourg | \n", "Syria New York, NY | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
578 | \n", "579 | \n", "2nd | \n", "Watt, Miss Bertha | \n", "12.000000 | \n", "Southampton | \n", "Aberdeen / Portland, OR | \n", "NaN | \n", "NaN | \n", "9 | \n", "female | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "
391 | \n", "392 | \n", "2nd | \n", "Dibden, Mr William | \n", "18.000000 | \n", "Southampton | \n", "New Forest, England | \n", "NaN | \n", "NaN | \n", "NaN | \n", "male | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "
1044 | \n", "1045 | \n", "3rd | \n", "Murphy, Miss Margaret | \n", "31.029621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "female | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
984 rows × 13 columns
\n", "\n", " | age | \n", "sex | \n", "class_1st | \n", "class_2nd | \n", "class_3rd | \n", "
---|---|---|---|---|---|
1086 | \n", "31.029621 | \n", "1 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
12 | \n", "31.029621 | \n", "0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
1036 | \n", "31.029621 | \n", "1 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
833 | \n", "32.000000 | \n", "1 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
1108 | \n", "31.029621 | \n", "1 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1225 | \n", "31.029621 | \n", "1 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
658 | \n", "31.029621 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
578 | \n", "12.000000 | \n", "0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "
391 | \n", "18.000000 | \n", "1 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "
1044 | \n", "31.029621 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "
984 rows × 5 columns
\n", "